library(readr)
library(ggplot2)
library(dplyr)
library(methods)
library(keras)
theme_set(theme_minimal())
Let’s look again at the flowers dataset. First, we load the metadata. This is exactly the same as any other dataset in which we pull in a CSV file from GitHub:
flowers <- read_csv("https://statsmaths.github.io/ml_data/flowers_17.csv")
flowers
## # A tibble: 1,360 x 4
## obs_id train_id class class_name
## <chr> <chr> <dbl> <chr>
## 1 id_000362 valid 4 crocus
## 2 id_000506 train 6 tigerlily
## 3 id_000778 valid 9 sunflower
## 4 id_001233 train 15 windflower
## 5 id_000274 train 3 bluebell
## 6 id_001218 train 15 windflower
## 7 id_001280 test NA <NA>
## 8 id_000895 train 11 colts foot
## 9 id_000851 valid 10 daisy
## 10 id_000084 train 1 snowdrop
## # … with 1,350 more rows
Then, we also have to grab the image data itself. To do this, first download the dataset here:
Save it somewhere on your computer and then read it into R:
x64 <- read_rds("image_data/flowers_17_x64.rds")
I again only want to look at the first 10 types of flowers.
x64 <- x64[flowers$class %in% 0:9,,,]
flowers <- flowers[flowers$class %in% 0:9,]
fnames <- flowers$class_name[match(0:9, flowers$class)]
fnames <- factor(fnames, levels = fnames)
If we want to improve our model further beyond dense neural networks, we need to include information beyond just the color of the flower. When we look at the images, our brains also use information about shape and texture. Let’s try to find a way to measure this in the image.
I will start by taking a sample flower image and creating a black and white version of it. A simple way to do this is to average the red, green, and blue pixels.
i <- 50
bw <- (x64[i,,,1] + x64[i,,,2] + x64[i,,,3]) / 3
plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
rasterImage(bw,0,0,1,1)
To detect texture we can take the brightness of each pixel and subtract it from the brightness of the pixel to its lower right. We can do this in a vectorized fashion as such:
edge <- abs(bw[-1,-1] - bw[-nrow(bw),-ncol(bw)])
plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
rasterImage(edge,0,0,1,1)
The resulting image roughly detects edges in the image. Notice that is has only 63-by-63 pixels due to the fact that we cannot compute this measurement on the rightmost or bottommost edges of the plot.
We’ll do this for each image, and save the number of pixels that have an edge value greater than 0.1. You could of course play around with this cutoff, or save a number of different cutoff values. This number will tell us roughly how much of the image consists of edges. A low number indicates a smooth petal and a a high one indicates a grassy texture to the flower.
mean_edge <- rep(0, nrow(flowers))
for (i in seq_len(nrow(flowers))) {
bw <- (x64[i,,,1] + x64[i,,,2] + x64[i,,,3]) / 3
edge <- abs(bw[-1,-1] - bw[-nrow(bw),-ncol(bw)])
mean_edge[i] <- mean(edge > 0.1)
}
A boxplot shows that there are differences between the flowers in this measurement. Crocuses in particular have a lot of edges.
qplot(flowers$class_name, mean_edge, geom = "blank") +
geom_boxplot() +
coord_flip() +
theme_minimal()
Most of the photos have a flower in the middle, but the background may include grass, sky, or other non-related elements. Let’s repeat the edge detector but now only such as the degree of edge-ness only for the middle of the image.
mean_edge_mid <- rep(0, nrow(flowers))
for (i in seq_len(nrow(flowers))) {
bw <- (x64[i,,,1] + x64[i,,,2] + x64[i,,,3]) / 3
edge <- abs(bw[-1,-1] - bw[-nrow(bw),-ncol(bw)])
mean_edge_mid[i] <- mean(edge[20:44,20:44] > 0.1)
}
This shows a clearly differentiation of the flowers. Fritillary have a lot of edges due to their spots in the middle of the photo. Notice that the patterns here are quite different from those in the whole image.
qplot(flowers$class_name, mean_edge_mid, geom = "blank") +
geom_boxplot() +
coord_flip() +
theme_minimal()
We will create a data matrix by putting together the color information with the mean_edge and mean_edge_mid metrics.
color_vals <- c(hsv(1, 0, seq(0, 1, by = 0.2)),
hsv(seq(0, 0.9, by = 0.1), 1, 1))
X_hsv <- matrix(0, ncol = length(color_vals),
nrow = nrow(flowers))
for (i in seq_len(nrow(flowers))) {
red <- as.numeric(x64[i,,,1])
green <- as.numeric(x64[i,,,2])
blue <- as.numeric(x64[i,,,3])
hsv <- t(rgb2hsv(red, green, blue, maxColorValue = 1))
color <- rep("#000000", nrow(hsv))
index <- which(hsv[,2] < 0.2)
color[index] <- hsv(1, 0, round(hsv[index,3] * 5) / 5)
index <- which(hsv[,2] > 0.2 & hsv[,3] > 0.2)
color[index] <- hsv(round(hsv[index,1],1), 1, 1)
X_hsv[i,] <- table(factor(color, levels = color_vals))
}
X_edge <- cbind(X_hsv, mean_edge, mean_edge_mid)
y <- flowers$class
X_train <- X_edge[flowers$train_id == "train",]
X_valid <- X_edge[flowers$train_id == "valid",]
y_train <- y[flowers$train_id == "train"]
y_valid <- y[flowers$train_id == "valid"]
library(glmnet)
model <- cv.glmnet(X_train, y_train, family = "multinomial",
alpha = 0.2)
plot(model)
I’ve included the cross-validation curve because it is a perfect textbook example of what the curve should look like (but rarely does so nicely). The resulting model performs much better than the color alone.
pred <- as.numeric(predict(model, newx = X_edge,
type = "class"))
tapply(pred == y, flowers$train_id, mean)
## train valid
## 0.620 0.455
A confusion matrix shows us that only a few flowers are still difficult to differentiate.
table(pred = fnames[pred[flowers$train_id == "valid"] + 1],
y = y[flowers$train_id == "valid"])
## y
## pred 0 1 2 3 4 5 6 7 8 9
## daffodil 7 1 1 1 0 1 0 2 1 5
## snowdrop 3 12 5 1 1 1 3 5 1 0
## lily valley 1 2 12 3 0 0 5 3 1 1
## bluebell 0 0 0 8 1 5 0 0 2 0
## crocus 0 1 0 3 15 2 1 3 0 0
## iris 0 0 1 0 0 7 0 0 2 0
## tigerlily 2 3 1 1 2 0 9 3 2 0
## tulip 5 1 0 0 0 1 1 1 2 3
## fritillary 0 0 0 1 1 2 0 0 9 0
## sunflower 2 0 0 2 0 1 1 3 0 11
The next step would be to figure out what features would help distinguish the “snowdrop”, “daffodil”, and “bluebell” flowers from the others as false positives and negatives from these groups are causing a large portion of the remaining errors.